Optimal Stem Identification in Presence of Suffix List
نویسندگان
چکیده
Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.
منابع مشابه
A Framework for Learning Morphology using Suffix Association Matrix
Unsupervised learning of morphology is used for automatic affix identification, morphological segmentation of words and generating paradigms which give a list of all affixes that can be combined with a list of stems. Various unsupervised approaches are used to segment words into stem and suffix. Most unsupervised methods used to learn morphology assume that suffixes occur frequently in a corpus...
متن کاملLittle by Little: Semi Supervised Stemming through Stem Set Minimization
In this paper we take an important step towards completely unsupervised stemming by giving a scheme for semi supervised stemming. The input to the system is a list of word forms and suffixes. The motivation of the work comes from the need to create a root or stem identifier for a language that has electronic corpora and some elementary linguistic work in the form of, say, suffix list. The scope...
متن کاملPhonetic Reflexes of Morphological Boundaries at a Normal Speech Rate
Our production experiment in Scottish English revealed that the duration of a rhyme immediately followed by a Level II suffix such as –s (the 1 person singular/plural/possessive suffix) and –t (the past tense suffix) was significantly longer than that of a monomorphemic counterpart. Such a durational difference between suffixed forms and monomorphemic forms was absent when the Level II suffix w...
متن کاملLearning Word Segmentation Rules for Tag Prediction
In our previous work we introduced a hybrid, GA&ILP-based approach for learning of stem-suffix segmentation rules from an unmarked list of words. Evaluation of the method was made difficult by the lack of word corpora annotated with their morphological segmentation. Here the hybrid approach is evaluated indirectly, on the task of tag prediction. A pair of stem-tag and suffix-tag lexicons is obt...
متن کاملPharmacognostic study of Argyreia pilosa stem
Background and objectives: Argyreia pilosa (Convolvulaceae) has been utilized for many aliments in the conventional system ethnomedicinally; most significantly against sexually transmitted diseases, skin troubles, diabetes, rheumatism, cough, and quinsy. The key challenge experienced in the standardization of herbal drugs is the correct identification of the plant sour...
متن کامل